-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genesis milestone 13: documentation / performance optimization #1113
Conversation
980c35a
to
ce3b2c6
Compare
241f00d
to
bbcd6e3
Compare
db11bc1
to
2acc4b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments here and there. Also:
- there are hlint warnings, please fix them before merge
- There are commits which are Unverified, I doubt branch protection rules will allow you to merge this.
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Util/LeakyBucket.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Util/LeakyBucket.hs
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Util/LeakyBucket.hs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this file will result in the same as the ResourceRegistry: we have an isolated component that does not depend on consensus, then it becomes difficult (in terms of bureaucracy) extracting it. I wonder if this should be a separate library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of our abstractions initially started in Consensus (eg the Fuse
you added most recently); but we could change that and put it at least into a different library. Is it fine to do that later in a separate PR?
ouroboros-consensus-diffusion/test/consensus-test/Test/Consensus/Genesis/Tests/CSJ.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus-diffusion/test/consensus-test/Test/Consensus/Genesis/Tests/CSJ.hs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't the GSM and GsmState live under a Node/Genesis/.
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The GSM is also used for pre-Genesis bootstrap peers, but yeah, we could move it in that subdirectory.
- The
GsmState
can't be moved there as it is used in the ChainSync client:
ouroboros-consensus/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Node/GsmState.hs
Lines 4 to 6 in 6b44bfb
-- | This module contains the definition of a state in the Genesis State Machine -- (GSM). The GSM itself is defined in 'ouroboros-consensus-diffusion', but the -- ChainSync client relies on its state.
@@ -247,7 +253,7 @@ initNodeKernel args@NodeKernelArgs { registry, cfg, tracers | |||
, GSM.setCaughtUpPersistentMark = \upd -> | |||
(if upd then GSM.touchMarkerFile else GSM.removeMarkerFile) | |||
gsmMarkerFileView | |||
, GSM.writeGsmState = \gsmState -> | |||
, GSM.writeGsmState = \gsmState -> do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unneeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this about the extraneous do
at the end of the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, GSM.writeGsmState = \gsmState -> do | |
, GSM.writeGsmState = \gsmState -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in general a bit concerned that Genesis seems to permeate into ouroboros-consensus and ouroboros-consensus-diffusion with subtle subtleties on which concepts go into each package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, partly, I think this is because we don't really have a great distinction already in main
(eg why does the forging loop live in ouroboros-consensus-diffusion
?). We might think about this and shuffle some things around, but probably in a later PR in case that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored with @jasagredo
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Genesis/Governor.hs
Outdated
Show resolved
Hide resolved
@@ -194,8 +237,14 @@ data DensityBounds blk = | |||
-- ChainSync instruction the peer sent, and whether the peer is idling (i.e. it | |||
-- sent @MsgAwaitReply@). | |||
-- | |||
-- @loeFrag@ is the fragment from the immutable tip to the first intersection | |||
-- with a candidate fragment. | |||
-- @loeFrag@ is the fragment anchored at the immutable tip and ending in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, variable loeFrag
is not visible in the resulting haddock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is introduced in the first line of this comment.
-- | @densityDisconnect genWin k states candidateSuffixes loeFrag@
-- yields the list of peers which are known to lose the density comparison with
-- any other peer, when looking at the genesis window after @loeFrag@.
-- ...
Let me know if that doesn't work still.
-- | ||
-- This type indicates whether the feature is enabled, and contains a value | ||
-- if it is. | ||
-- This type indicates whether LoE is enabled, and contains a value if it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain what this value represents and give some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of a
depends on the context on which the LoE is used. There is no intended use at the definition.
LoE a
is isomorphic to Maybe a
, with the added meaning that Just
is only used if LoE is enabled. I elaborated this point in 20d86b3.
@@ -649,7 +648,7 @@ chainSelectionForBlock cdb@CDB{..} blockCache hdr punish = electric $ do | |||
-- ^ The current chain and ledger | |||
-> LoE (AnchoredFragment (Header blk)) | |||
-- ^ LoE fragment | |||
-> LoELimit | |||
-> LoE Word64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which cases will LoE
be parametrized with other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example here:
type GetLoEFragment m blk = m (LoE (AnchoredFragment (Header blk)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this file not part of ouroboros-consensus
like the GSM state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good answer (note that the GSM was introduced in #808, not in this PR), also see #1113 (comment)
-- Evaluating the GDD rule might cause peers to be disconnected if they have | ||
-- sparser chains than the best chain. | ||
-- | ||
-- The LoE fragment is the fragment anchored at the immutable tip and ending at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the reference to LoE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of the comment has been removed, but there is also a reference for LoE in the module header.
import Ouroboros.Network.AnchoredFragment (AnchoredFragment) | ||
import qualified Ouroboros.Network.AnchoredFragment as AF | ||
|
||
-- We have multiple other Genesis-related types of a similar shape ('LoE', LoP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a REVIEW
or TODO
keyword maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been removed.
, gcsCSJConfig :: !CSJConfig | ||
} | ||
|
||
-- TODO justification/derivation from other parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we want to address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably after "real world" performance benchmarks on resource-limited machines. For now, the values are conservative; tuning them could be one part of a phased rollout of Genesis.
...os-consensus-diffusion/src/ouroboros-consensus-diffusion/Ouroboros/Consensus/Node/Genesis.hs
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,4 @@ | |||
### Breaking | |||
|
|||
- Internal refactorings in the CSJ and GDD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the same grammatical form across all changelogs (at least within the same PR :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, addressed 👍
This also involves some fairly important changes to the leaky bucket and to the leaky bucket API. These changes make the leaky bucket significantly more robust. Co-authored-by: Alexander Esgen <alexander.esgen@iohk.io>
* Rewrite `Peers` to accept arbitrary number of peers * Actually generate honest peers in CSJ happy path * Support a field for extra honest peers in `GenesisTest` * Allow `uniformPoints` to generate schedules with multiple honest peers * Adapt CSJ test to use native multiple honest peers generation * Share partial accessor functions used in tests * Use partial accessor to retrieve the only honest peer
Also: * Remove ill-defined LoELimit * Remove unused updateLoEFragStall * Rename runGdd to runGDDGovernor * Print GDD in traces instead of GDG * Remove the UpdateLoEFrag callback * Reorder functions in the Governor module
…' and disable timeouts as they are supposed to be
Also: * LoEEnabled: make payload strict to avoid NoThunks failures * LoE: allow to dynamically en-/disable * Depending on the state of the GSM, we want to either en- or disable the LoE. * Refactor `runGdd` to use `Watcher`, share trigger logic Previously, we would duplicate the logic for when to trigger the GDD between the NodeKernel and the peer simulator. * Add GDD tracing The `Show` instances are probably way too large ATM, but there are currently unused, so that isn't a pressing concern. * LoP rate: fix typo 500/s = 1/2ms, not 2/ms 🤦
Previously, it wasn't possible to eg run *just* CSJ.
Modify the honest shrinking function by no longer speeding up the other schedules when an honest tick is deleted. This simplifies a lot of code in the `Shrinking` module, at the cost of no longer ensuring that shrunk schedules preserve the overall order of events. Additionally, we re-enable shrinking in CSJ tests Also: * Extend adversarial schedules when shrinking an honest one * Document the cases when we don't use shrinking
I rebased |
Co-authored-by: Javier Sagredo <javier.sagredo@iohk.io>
Co-authored-by: Damian Nadales <damian.only@gmail.com> Co-authored-by: Javier Sagredo <javier.sagredo@iohk.io>
Fixup for commit 9011106 Rework `followsLoEFrag` into `trimToLoE`
I'm labeling "no changelog" since I don't think the changes to |
6bcfdc9
to
5680c0b
Compare
5680c0b
to
53ea76c
Compare
This PR improves the code documentation of Genesis, and implements some fixes and optimizations.
Documentation, refactoring, and testing
The code documentation of GDD, CSJ, and LoE is better connected in cb5b69a.
Additionally, cd032b1 brings small changes to the LoE, motivated by issues found while integrating the LoE into the ChainDB q-s-m tests (see #1119).
2f7152b enables CSJ in all of the uniform tests and tests run more tries now.
Shrinking of peer schedules has been simplified in 627634f, and 6b44bfb adds an extra field to the
PointSchedule
type to specify the time at the end of a test, allowing to clean up a lot of schedules.Profiling and optimizations
We consider our baseline, the fastest possible syncing, to be the scenario in which the syncing peer connects to a caught up peer and syncs using Praos.
From the profiling of Genesis on mainnet data, the GDD governor and chain selection was revealed as one of the major sinks of CPU time when enabling Genesis and syncing from multiple peers (between 2 and 30 peers).
Chain selection has been made idempotent again, to save on useless recomputations in 82376e6. Idempotency had been lost in #1015. 64f3da3 also calls to change selection less often, by avoiding calls when the LoE fragment doesn't change in a significant way.
The overhead of GDD is eliminated in a592c65 by replacing a frequent call to
AnchoredFragment.takeWhileOldest
with a faster variant.With these changes, the only significant overhead of syncing from multiple peers comes from the BlockFetch logic, which is the topic of ongoing work.
CSJ tests
The testing code has been adapted in a969b37 to allow to specify multiple honest peers in CSJ tests. CSJ tests are the first time we care of running multiple honest peers to check that header are downloaded from only one of them.
We also added a test in 1d7fc95 to check that headers are downloaded from at most one peer even in the presence of adversarial peers.
GSM integration
In b525959 Limit on Eagerness is enabled only while GSM is in syncing state, and 83b12ee does similarly for LoP.
f69fd1f makes possible to toggle genesis components with environment variables. This was helpful to measure the overheads of the individual components.